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Abstract 

We introduce a new multilevel domain decomposition method (MDD) for elec- 
tronic structure calculations within semi-empirical and Density Functional Theory 
(DFT) frameworks. This method iterates between local fine solvers and global 
coarse solvers, in the spirit of domain decomposition methods. Using this approach, 
calculations have been successfully performed on several linear polymer chains con- 
taining up to 40,000 atoms and 200,000 atomic orbitals. Both the computational 
cost and the memory requirement scale linearly with the number of atoms. Addi- 
tional speed-up can easily be obtained by parallelization. We show that this do- 
main decomposition method outperforms the Density Matrix Minimization (DMM) 
method for poor initial guesses. Our method provides an efficient preconditioner for 
DMM and other linear scaling methods, variational in nature, such as the Orbital 
Minimization (OM) procedure. 
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1 Introduction and motivation 

A central issue in computational quantum chemistry is the determination of the electronic 
ground state of a molecular system. For completeness and self-consistency, we now briefly 
introduce the problem. In particular, we present it in a mathematical way. 



1.1 Standard electronic structure calculations 

A molecular system is composed of N electrons, modelled quantum mechanically, and a 
given number of nuclei, the latter being considered as classical point-like particles clamped 
at known positions (Born-Oppenheimer approximation). We refer to 7] for a general 
mathematical exposition and to [T^ IT^ for the chemical background. Determining the 
electronic ground state amounts to solving a time-independent Schrodinger equation in 
R^^. This goal is out of reach for large values of A^. In fact it is already infeasible for 
values of A^ exceeding three or four, unless dedicated techniques are employed. Examples 
are stochastic-like techniques such as Diffusion Monte-Carlo approaches, or emerging tech- 
niques, such as sparse tensor products techniques |T7] . Approximations of the Schrodinger 
equation have been developed, such as the widely used tight-binding, Hartree-Fock and 
Kohn-Sham models. For these three models, the numerical resolution of a problem of the 
following type is required: given H and S, respectivement an Nf, x Nb symmetric matrix 
and an A^;, x A^;, symmetric positive definite matrix (with Nb > N), compute a solution 
of the problem 

Hci = eiSci, ei < . . . < Cat < e^+i < ■ ■ ■ < cat,, 

c-S'ci = Si 



N 



1.11 



Let us mention that most electronic structure calculations are performed with closed shell 
models jT2]; and that, consequently, the integer A^ in ()1.1|) then is the number of electron 
pairs. We remark that when S is the identity matrix, a solution to ()1.1|) is a solution 
to the problem 

Find the orthogonal projector on the space spanned by the A^ eigenvectors ,^ 
associated with the lowest A^ eigenvalues of if. 

In (II. 2p . and throughout this article, the eigenvalues are counted with their multiplicities. 
The A^ eigenvectors q, called generalized eigenvectors in order to emphasize the presence 
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of the matrix S, represent the expansion in a given Galerkin basis {Xi}i<i<Nt 

N one-electron wavef unctions. The matrix if is a mean-field Hamiltonian matrix. For 

instance, for the Kohn-Sham model, we have 

H^J = \ I Vx. ■ Vx, + / VxrX, (1-3) 

where ^ is a mean-field local potential. The matrix S is the overlap matrix associated 
with the basis {Xi}i<i<iv,: 

S^J = [ XrXj- (1.4) 

In this article, we focus on the Linear Combination of Atomic Orbitals (LCAO) approach. 
This is a very efficient discretization technique, using localized basis functions {Xi}, com- 
pactly supported [22] or exhibiting a gaussian fall-off [12] . 

It is important to emphasize what makes the electronic structure problem, discretized 
with the LCAO approach, specific as compared to other linear eigenvalue problems en- 
countered in other fields of the engineering sciences (see [21(131 for instance). First, A^^ is 
proportional to A^, and not much larger than it (say A";, ~ 2N to fix the ideas). Hence, the 
problem is not finding a few eigenvectors of the generalized eigenvalue problem (jl.ll) . Sec- 
ond, although the matrices H and S are sparse for large molecular systems (see section IT!^ 
for details), they are not as sparse as the stiffness and mass matrices usually encountered 
when using finite difference or finite element methods. For example, the bandwith of H 
and S is of the order of 10^ in the numerical examples reported in section [^ Note that, 
in contrast, for plane wave basis set discretizations (which will not be discussed here), 
the parameter A^;, is much larger than A^ (say A";, ~ 100 A^), the matrix S is the identity 
matrix and the matrix H est full. Third, and this is a crucial point, the output of the cal- 
culation is the matrix and not the generalized eigenvectors Cj themselves. This is the 
fundamental remark allowing the construction of linear scaling methods (see section IT!^ . 

A solution of ((HH) is 

= C.Cl (1.5) 
where is a solution to the minimization problem 

inf|Tr(^i7CC*), C e M^''^ {"R), C SC = (1.6) 

Note that the energy functional Tr ^ifCC*^ can be given the more symmetric form 
TxiC^HC ). Here and below, Al'^'' denotes the vector space of the k x / real matrices. 
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Notice that has many minimizers: if is a minimizer, so is C*t/ for any orthogonal 
N X N matrix U. However, under the standard assumption that the A^-th eigenvalue of 
H is strictly lower than the (A^ + l)-th one, the matrix D* defined by ()1.5|) does not in 
fact depend on the choice of the minimizer C^, of p.6p . Notice also that (jl.lj) are not the 
Euler-Lagrange equations of ()1.6|1 but that any critical point of ()1.6|1 is obtained from a 
solution of (jl.lll by an orthogonal transformation of the columns of Cj, = (ci| ■ ■ ■ \cn)- 

The standard approach to compute D^, is to solve the generalized eigenvalue prob- 
lem (jl.lj) and then construct thus by collecting the lowest N generalized eigenvec- 
tors of H. This approach is employed when the number N of electrons (or electron pairs) 
is not too large, say smaller than 10^. 

1.2 Linear scaling methods 

One of the current challenges of Computational Chemistry is to lower the computational 
complexity A^^ of this solution procedure. A linear complexity N is the holy grail. There 
are various existing methods designed for this purpose. Surveys on such methods are 
inmi- Our purpose here is to introduce a new method, based on the domain decomposi- 
tion paradigm. We remark that the method introduced here is not the first occurrence of 
a method based on a decomposition of the matrix H but a significant methodologi- 
cal improvement is fulfilled with the present method. To the best of our knowledge, such 
methods only consist of local solvers complemented by a crude global step. The method 
introduced below seems to be the first one really exhibiting the local/global paradigm in 
the spirit of methods used in other fields of the engineering sciences. Numerical oberva- 
tions confirm the major practical interest methodological improvement. 

Why is a linear scaling plausible for computing D^? To justify the fact that the cubic 
scaling is an estimate by excess of the computational task required to solve (jl.lj) . we 
argue that the matrix does not need to be diagonalized. As mentioned above, only the 
orthogonal projector on the subspace generated by the lowest eigenvectors is to be 
determined and not the explicit values of these lowest A^ eigenvectors. But in order to 
reach a linear complexity, appropriate assumptions are necessary, both on the form of the 
matrices H and S, and on the matrix solution to (jl.lj) : 

• (HI). The matrices H and S are assumed sparse, in the sense that, for large systems, 
the number of non-zero coefficients scales as A^. This assumption is not restrictive. 
In particular, it follows from (jl.3j) and (jl.4j) that it is automatically satisfied for 
Kohn-Sham models as soon as the basis functions are localized in real space, which 
is in particular the case for the widely used atomic orbital basis sets j7j; 
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• (H2). A second assumption is that the matrix buih from the solution to p.lll is 
also sparse. This condition seems to be fulfilled as soon as the relative gap 



deduced from the solution of (jl.ip is large enough. As explained in section |21 below, 
this observation can be supported by qualitative physical arguments. On the other 
hand, we are not aware of any mathematical argument of linear algebra that would 
justify assumption (H2) in a general setting. 

We assume (H1)-(H2) in the following. Current efforts aim at treating cases when the 
second assumption is not fulfilled, which in particular corresponds to the case of conduct- 
ing materials. The problem ()1.2j) is then extremely difficult because the gap 7 in ()1.7p 
being very small, the matrix D is likely to be dense. Reaching linear complexity is then a 
challenging issue, unsatisfactorily solved to date. State of the art linear scahng methods 
presented in the literature experience tremendous difficulties (to say the least) in such 
cases. It is therefore reasonable to improve in a first step the existing methods in the 
setting of assumption (H2), before turning to more challenging issues. 

Before we get to the heart of the matter, we would like to point out the following 
feature of the problem under consideration. 

In practice. Problem has to be solved repeatedly. For instance, it is the inner loop 
in a nonlinear minimization problem where H depends self-consistently on D^. We refer to 
|Hl El for efficient algorithms to iterate on this nonlinearity and to [7 for a review on the 
subject. Alternatively, or in addition to the above, problem is parametrized by the 
positions of the nuclei (both the mean-field operator H and the overlap matrix S indeed 
depend on these positions), and these positions may vary. This is the case in molecular 
mechanics (find the optimal configuration of nuclei that gives the lowest possible energy 
to the molecular system), and in molecular dynamics as well (the positions of nuclei 
follow the Newton law of motion in the mean-field created by the electrons). In either 
case, problem ()1.H) is not be solved from scratch. Because of previous calculations, we 
may consider we have at our disposal a good initial guess for the solution. The latter 
comes from e.g. previous positions of nuclei, or previous iterations in the outer loop 
of determination of H. In difficult cases it may even come from a previous computation 
with a coarse grained model. In other words, the question addressed reads solving Problem 
m.l}} for some H + 6H and S + 6S that are small perturbations of previous H and S for 
which the solution is known. This specific context allows for a speed up of the algorithm 
when the initial guess is sufficiently good. This is the reason why, in the following, we 
shall frequently make distinctions between bad and good initial guesses. 
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2 Localization in Quantum Chemistry 

The physical system we consider is a long linear molecule (for instance a one- dimensional 
polymer or a nanotube). Let us emphasize that we do not claim a particular physical 
relevance of this system. This is for the purpose of illustration. We believe the system 
considered to be a good representative of a broad class of large molecular systems that 
may be encountered practically. Each atomic orbital Xi is centered on one nucleus. Either 
it is supported in a ball of small radius [201 (i^ comparison to the size of the macromolecule 
under study), or it has a rapid exponential- like or Gaussian-like [TU] fall-off. The atomic 
orbitals are numbered following the orientation of the molecule. Then, the mean-field 
Hamiltonian matrix H whose entries are defined by has the band structure shown 
in Figure ^ 

Although the eigenvectors of H are a priori delocalized (most of their coefficients 
do not vanish), it seems to be possible to build a S-orthonormal basis of the subspace 
generated by the lowest eigenvectors of H, consisting of localized vectors (only a few 
consecutive coefficients are non zero). This is motivated by a physical argument of locality 
of the interactions [THj. For periodic systems, the localized vectors correspond to the so- 
called Wannier orbitals jl]. It can be proven that in this case, the larger the band gap, the 
better the localization of the Wannier orbitals jT3]. For insulators, the Wannier orbitals 
indeed enjoy an exponential fall-off rate proportional to the band gap. For conductors, 
the fall-off is only algebraic. As mentioned in the introduction, we only consider here the 
former case. This allows us to assume that there exists some integer q <C A^^, such that 
Nb/q is an integer, for which all of these localized functions can be essentially expanded 
on q consecutive atomic orbitals. Denoting by n = 2q, we can therefore assume a good 
approximation of a solution C^ to ()1.6|) exists, with the block structure displayed on Fig- 
ure |21 Note that each block Cj only overlaps with its nearest neighbors. Correspondingly, 
we introduce the block structure of H displayed on Figure |21 The matrix D constructed 
from a block matrix C using ()1.5|) has the structure represented in Figure 0] and satisfies 
the constraints D = D\ = D, Tt{D) = N. 

Let us point out that the integers q and n = 2q depend on the band gap, not on the 
size of the molecule. The condition n = 2g is only valid for S = In^. For S ^ In,,, it is 
replaced hj n = 2q + nhs where 2 nhs — 1 is the bandwidth of the matrix 5*. 

The domain decomposition algorithm we propose aims at searching an approximate 
solution to that has the block structure described above. 

For simplicity, we now present our method assuming that S = In^, i-e. that the 
Galerkin basis {Xi}i<i<N i^ orthonormal. The extension of the method to the case when 
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S 7^ In,, is straightforward. Problem then reads 

inf|Tr(^iJCC*), C E 7W^'"^(R), C'C = I 



(2.1) 



Our approach consists in solving an approximation of problem ()2.H1 obtained by min- 
imizing the exact energy Tr^i7CC*j on the set of the matrices C which have the block 

structure displayed on Figure El and satisfy the constraint C*C = In- The resulting 
minimization problem can be recast as 



r ^ 

inf J2 i^^C.Cj) , Q G A<"'"^»(R), 
i=i 



m^GlN, ClCi = Im, ^l<^<P, 



ClTCi+i = Vl<i<p-1, ^mi = N 

i=l 

In the above formula, T G A^'"'"(R) is the matrix defined by 

y ^ r 1 if A; - / = g 
^' 1^ otherwise 

and Hi G A1"'"(R) is a symmetric submatrix of H (see Figure El). Indeed, 
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In this way, we replace the ^^^^"^-^ global scalar constraints C*C = Ij^ involving vectors 

of size Njj, by the Yl^=i local scalar constraints C*Cj = and the Yl^=i ^I'^^i+i 

local scalar constraints C*TCi+i = 0, involving vectors of size n. We would like to em- 
phasize that we can obtain in this way a basis of the vector space generated by the lowest 
N eigenvectors of if, but not the eigenvectors themselves. This method is therefore not 
directly applicable to standard diagonalization problems. 

Our algorithm searches for the solution to (j2.2p . not to (|2.ip . More rigorously stated, 
we search for the solution to the Euler-Lagrange equations of (|2.2p : 

H,C, = a^, + T*a_iA,_i,, + TC,+iA*,+i l<z<P, 
ClC = Im, l<i<P, (2.4) 

ClTC,+i = l<i<p-l, 



where by convention 

Co = Cp+i = 0. (2.5) 

The matrices (_Ej)i<j<p and (Aj j_|.i)i<j<p_i respectively denote the matrices of Lagrange 
multipliers associated with the orthonormality constraints C^Ci = Im^ and C*TCj+i = 0. 
The rrii x rrii matrix Ei is symmetric. The matrix Aj j+i is of size rrii x rrii^i. The above 
equations can be easily derived by considering the Lagrangian 



C {{C,} , {E,} , {A,,+,}) = 5^Tr(i/,aC*)+5^Tr((C*a-/„jE.) 

1=1 1=1 
p-i 

+ 5^Tr(CtTa+i4,+0- 



1=1 



The block structure imposed on the matrices clearly lowers the dimension of the search 
space we have to explore. However, this simplification comes at a price. First, prob- 
lem (j2.2j) only approximates problem (j2.H) . Second, (j2.2|) may have local, non global, 
minimizers, whereas all the local minimizers of (j2.H) are global. There are thus a priori 
many spurious solutions of the Euler Lagrange equations ()2.4p associated with ()2.2|) . 

A point is that the sizes (mj)i<j<p are not a priori prescribed. In our approach, they 
are ajusted during the iterations. We shall see how in the sequel. 
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3 Description of the domain decomposition algorithm 

3.1 Description of a simplified form 

For pedagogic purpose, we first consider the following problem 

mf{{H^Z,,Z^) + {H2Z2,Z2), GR^^ {Z„Z,) = 1, {Z,,Z^) = Q]. (3.1) 

Problem (jH.lj) is a particular occurence of ()2.2j) . We have denoted by (-, ■) the standard 
Euclidean scalar product on R^*". 

For ()3.1|) . the algorithm is defined in the following simplified form. Choose [Z^^Z^) 
satisfying the constraints and construct the sequence {Z^, Z2)ket^ by the following itera- 
tion procedure. Assume {Z^, Z|) is known, then 

• Local step. Solve 

|f = arginf{(ifiZi, Z,), Z, G {Z,, Z,) = 1 {Z,, Zf, = O}, 

Zl = arginf{(i72^2, Z,), Z^ G R^N {Z,, Z^) = 1 (Zf, Z2) = O} ; ^ ' ^ 

Global step. Solve 

a* = arginf{(ifiZi, Z^) + {H2Z2, Z2), ae R} (3.3) 



where 



and set 



Z. . ^ Z£|l±#, (3.4) 



,,,_zt+£a. + (3.5) 

In the fc-th iteration of the local step, we first fix Z2 = Z2 and optimize over Zi 
to obtain Z^. Then we fix Zi = Z^ and optimize over Z2 to obtain Z2. This local 
step monotonically reduces the objective function, however, it may not converge to the 
global optimum. The technical problem is that the Lagrange multipliers associated with 
the constraint (^1,^2) = may converge to different values in the two subproblems 
associated with the local step. In the global step, we optimize the sum {HiZi, Zi) + 
{H2Z2, Z2) over the subspace spanned by Z^ and Z|, subject to the constraints in ()3.1|) . 
The global step again reduces the value of the objective function since Z^ and Z2 are 
feasible in the global step. It can be shown that the combined algorithm (local step + 
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global step) monotonically decreases the objective function and globally converges to an 
optimal solution of ()3.1|) . 

This algorithm operates at two levels: a fine level where we solve two problems of 
dimension A^^^ rather than one problem of dimension 2Nfy; a coarse level where we solve a 
problem of dimension 2. Left by itself, the fine step converges to a suboptimal solution 
of (jtllj) . Combining the fine step with the global step yields convergence to a global 
optimum. 

In addition to providing a pedagogic view on the general algorithm presented in the 
following section, the simplified form ()3.2|) - ()3.5|) has a theoretical interest. In contrast to 
the general algorithm for which we cannot provide a convergence analysis, the simplified 
form ()3.2j) - ()3.5j) may be analyzed mathematically, at least in the particular situation when 
Hi = H2 = H. Then solving ()3.H1 amounts to searching for the lowest two eigenelements 
of the matrix H. Notice that the global step ()3.3|) - ()3.5p is then unnecessary because the 
functional to minimize in ()3.3|) does not depend on a. 

However, we can show that the iterations ()3.2|) converge in the following sense. The 
2-dimensional vector space spanned by the lowest two eigenvalues of H is reached asymp- 
totically. This occurs under an appropriate condition on the matrix H. The latter is a 
condition of separation of the eigenvalues, namely €2 — ei < €3 — €2 with obvious notation. 
The gap €3 — €2 gives the speed of convergence. For brevity, we do not detail the proof 
here (see jS]). Future work on the numerical analysis of more general cases is in progress. 

3.2 Description of the algorithm 

We define, for all p-tuple {Ci)i<i<p, 

p 

^((a)i<.<p) = 5^Tr(i7,C,C*), (3.6) 

i=l 

and set by convention 

Uo = Up = 0. (3.7) 

We introduce an integer e, initialized to one, that will alternate between the values zero 
and one during the iterations. 

At iteration k, we have at hand a set of block sizes (mf)i<i<p and a set of matrices 
{C^)i<i<P such that Cf G A<"''"'(R), [Cf]*Cf = [Cf]*TCf+i = 0. We now explain 
how to compute the new iterate (m^^^)i<j<p, (Cf^^)i<j<p. 

Multilevel Domain Decomposition (MDD) algorithm 
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• Step 1: Local fine solver. 

(a) For each i, diagonalize the matrix if2i+e in the subspace 

^t+e = e R", [Cli+e-l] 'Tx = 0, x'TC',,^,^, = O} , 

i.e. diagonahze P2Ve-^2i+e-P2i+e where P2i+e is the orthogonal projector on 
This provides (at least) n— m2j+g_i— ml^^^^^ real eigenvalues )^2i+e,i — •^2i+e,2 ^ 
• • • and associated orthonormal vectors a;2j+ej- The latter are T-orthogonal to 
the column vectors of Cf_i and Cf+i- 

(b) Sort the eigenvalues {^2i+e,j)i,j i^ increasing order, and select the lowest m2i+e 

i 

of them. For each i, collect in block + e the eigenvalues A2j_|_gj selected. 
New intermediate block sizes rh2i^^ are defined. 

^ 

(c) For each i, collect the lowest ^7121+^ vectors X2i^^j in the n x m2i+e matrix C2j_|_e. 

(d) For each i, diagonalize the matrix H2i+e+i in the subspace 



yk 













* = 0, 



in order to get eigenvalues Aai+^+i^i < A2i+g+i^2 < • • • and associated orthonor- 
mal vectors X2j+e+ij. The latter are T-orthogonal to the column vectors of 



^2i+e ^2i+e+2- 



(e) Sort all the eigenvalues {{X2i_^_^_^_lJ)iJ, {X2i^^ j)ij} in increasing order. Select 
the lowest A^. For each /, collect in block the eigenvalues Xf j selected. New 
intermediate block sizes {m'l~^^)i<i<p are thus defined. 



T*^ I ... I 



(f ) Set Cf 

(g) Replace e by 1 — e and proceed to step 2 below. 
• Step 2: global coeirse solver. Solve 

U* = arginf U = {Ui)i, VI < i < p - 1 Ui e A<™^+i'">(R)}, (3.^ 

where 



f{U)^£(^{c,{U){C,{UYC,{U)) (3-9) 
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and 

C,{U) = + TC^_^^U^ ([C'lfTT'cf^ - T'CtiUti (\C'l]'T*Tcf^ . (3.10) 
Next set, for all 1 < z < p, 

C^+^ = C,{W) (Ci{U*y C,{U*)y^^\ (3.11) 

Note that TCf+Y = (this follows from = 0). 

We think of the even indexed unknowns C2i as the black variables and the odd indexed 
unknowns 6*2^+1 as the white variables. In the first phase of the local fine solver, we 
optimize over the white variables while holding the black variables fixed. In the second 
phase of the local fine solver, we optimize over the black variables while holding the white 
variables fixed. In the global step, we perturb each variable by a linear combination of 
the adjacent variables. The matrices U = {Ui)i in ()3.8j) play the same role as the real 
parameter a in ()3.3p . The perturbation is designed so that the constraints are satisfied. 
The optimization is performed over the matrices generating the linear combinations. In 
the next iteration, we interchange the order of the optimizations: first optimize over 
the black variables while holding the white variables fixed, then optimize over the white 
variables while holding the black variables fixed. 

Let us point out that an accurate solution to ()3.8|1 is not needed. In practice, we 
reduce the computational cost of the global step, by using again a domain decomposition 
method. The blocks (Cj)i<j<p are collected in r overlapping groups (Gz)i</<r as shown 
in Figure El Problem ()3.8|1 is solved first for the blocks (G2/+1), next for the blocks 
{G21). Possibly, this procedure is repeated a few times. The advantage of this strategy 
is that the computational time of the global step scales linearly with A^. In addition, 
it is parallel in nature. The solution of ()3.8p for a given group is performed by a few 
steps of a Newton-type algorithm. Other preconditioned iterative methods could also be 
considered. 

3.3 Comments on the local step 

The local step is based on a checkerboard iteration technique. 
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When e = 1, steps la-lc search for a solution (m2j_,_i, C*2j_|_i)i to the problem 
infj 5;]Tr {H2^+lC2^+lC'2,+,) , e TW^'^^^+nH), C*,+iC2,+i = 

^ i 

[C2iYTC2i+l = 0, (72^+17(72^+2 = 0) 

m2i+i e N, m2i+i = m2i+i > . 

During steps la-lc, the "white" blocks are kept fixed. The "black" blocks C^j+i are 
optimized under the orthogonality constraints imposed by the "white" blocks. A point 
is that most of the computational effort can be done in parallel. Indeed, for p even, say, 
performing step la amounts to solving p/2 independent diagonalisation problems of size n. 

Likewise, steps Id- If solve 

mil 5^Tr {HidCl) , d G M"'"^'(R), Old = m, e JN, ^m, = TV 

1=1 i 

[C2J-1YTC2, = 0, [C2,fT[C2,+iY = 0, 

, ^ I 

< m2j+i < m2j+i, C2j+i C C2J+1 >, 

where the notation C2j+i C (^g^+i means that each column of C2j+i is a column of (72j+i. 
Here again, most of the computational effort can be performed in parallel. 

When e = 1, "black" vectors (i.e. vectors belonging to blocks with odd indices) are 
allowed to become "white" vectors, but the reverse is forbidden. In order to symmetrize 
the process, e is replaced by 1 — e in the next iteration. 

We wish to emphasize that, although called local, this step already accounts for some 
global concern. Indeed, and it is a key point of the local step, substeps (b) and (e) sort 
the complete set of eigenvalues generated locally. This, together with the update of the 
size rrii of the blocks, allows for a preliminary propagation of the information throughout 
the whole system. The global step will complement this. 

Finally, let us mention that in the local steps, (approximate) T-orthogonality is ob- 
tained by a Householder orthonormalization process. The required orthonormality crite- 
rion is 

Vl<i<p-1, \\[C^fTCl,\\ < SL, (3.12) 
where > is a threshold to be chosen by the user. 
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3.4 Comments on the global step 

Let us briefly illustrate the role played by the global step. For simplicity, we consider the 
case of two blocks of same initial size mi = m,2 = m and we assume that mi and m2 do 
not vary during the iterations. If only the local step is performed, then the new iterate 

does not necessarily satisfies ()2.4|) . Indeed, there is no reason why the Lagrange multipliers 
corresponding to the two contraints C^TC^ = (step la when e = 1) on the one hand and 
[C^YTC = (step Id when e = 1) on the other hand should be the same. The global step 
asymptotically enforces the equality of Lagrange multipliers. This is a way to account for 
a global feature of the problem. 

Let us emphasize this specific point. Assume U* = in the global step of the k-th 
iteration of the algorithm, or in other words that the global step is not effective at the k-th 
iteration. Then it implies that the output (Ci,C2) = (C*i,C|) of the local step already 
satisfies fl2.4p . Indeed, 



f{U) = Ti(J,{U)C,{UYH,C,{U)]^Ti(UU)C2{UYH2C2{U) 



(3.13) 



with Ji{U) 



c,{uyc,{u) 
-1 



-1 



for i = 1, 2. Since 



C{TT'CAuHc\T'TC2)U[C\TT'Ci 



-1 



we have VJi(O) = VJ2(0) 
all 1 < i, j < m, 



Im + CITTC2 U C{TTCi W CITTC2 



0. The matrix U being a square matrix of dimension m, for 



1 df 



2 dUr 



(0) 



Tr 



dCi 



(0) 



Tr 



dU, 



(0) 



H2C2 



C{TT'Ci ) C\HiTC2 - C{TH2C2[CIT^TC2 



C{TT'Ci) (Ai - A2) (ClT*TC2 
where Ai and A2 are defined by 

/fiCi = CiEi+T(72A*, 
H2C2 = C2E2 + T*CiA2. 



(3.14) 



(3.15) 
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As f/* = implies 

df 

Vl<z,j<m ^(0) = 0, (3.16) 

we conclude that Ai = A2 if the matrices ^C*TT*Cij and {C2T^TC2^ are invertible, 

which is generally the case when n ^ 2m. Consequently, ()2.4p is satisfied by (Ci, C2). 

On the other hand, when n is not much larger that 2m, the above matrices are not 
invertible and ()2.4|1 is usually not satisfied. In this case, the global step is slightly modified 
in order to recover ()2.4j) and thus improve the efficiency of the global step. We replace 
(nmiD by 

V 1 < i < p, Q(p() = C^ + TC^^^U, ([C'l]Wcf^ - T'CtiUti ([C'l]'T'Tcf^ (3.17) 

where is a block formed by vectors collected in the vector space defined by . These 
vectors are selected using a modified Gram-Schmidt orthonormalization process. The size 
of the blocks is appropriately chosen. The larger the blocks C*f , the more precise the 
global step but the worse the conditioning of the optimization problem. In addition, since 
the global step is the most demanding step of the algorithm, considerations both on the 
computational time and in terms of memory are accounted for when fixing the sizes of 
the blocks Cf . 

Our numerical experiments show that when the global step is performed (using ()3.10p 
or ()3.17p . depending on n and m), the blocks {C^^'^)i do not exactly satisfy the orthonor- 
mality constraint, owing to evident round-off errors. All the linear scaling algorithms have 
difficulties in ensuring this constraint and our MDD approach is no exception. The tests 
performed however show that the constraint remains satisfied throughout the iterations 
within a good degree of accuracy. 

4 Numerical tests 

An extensive set of numerical tests was performed to illustrate the important features of 
the domain decomposition algorithm introduced above, and to compare it with a standard 
scheme, commonly used in large scale electronic structure calculations. 

4.1 Setting of the algorithm and of the tests 

Molecular systems used for the tests Numerical tests on the algorithm presented 
above were performed on three chemical systems. The first two systems both have for- 
mula COH-(CO)„^-COH. They differ in their Carbon-Carbon interatomic distances. For 
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system Vi, this distance is fixed to 5 atomic units, while it is fixed to 4 for system V2- 
On the other hand, our third system, denoted by V3 has formula CH3-(CH2)„„-CH3. 

For each of the three systems Vi, V2, "Ps, several numbers ra^ of monomers were 
considered. A geometry optimization was performed using the GAUSSIAN package [22] 
in order to fix the internal geometrical parameters of the system. The only exception to 
this is the Carbon-Carbon distance for Vi and V2, which, as said above, is fixed a priori. 
Imposing the Carbon-Carbon distance allows to control the sparsity of the matrices H and 
5* (the larger the distance, the sparser the matrices). Although not physically relevant, 
fixing the Carbon-Carbon distance is therefore useful for the purpose of numerical tests. 

Data, parameters and initialization For an extremely large number of monomers, 
the matrices H, S, and cannot be generated directly with the GAUSSIAN package. 
We therefore make a periodicity assumption. For large values of n^, these matrices ap- 
proach a periodic pattern (leaving apart, of course, the "boundary layer", that is the 
terms involving orbitals close to one end of the linear molecule) . So, we first fix some 
sufficiently large, but for which a direct calculation with Gaussian is feasible, and con- 
struct H, S. The matrices H and S, as well as the ground-state density matrix D^,, and 
the ground-state energy Eq, are then obtained for arbitrary large ra^ assuming periodicity 
out of the "boundary layer" . Likewise, the gap 7 in the eigenvalues of H is observed to be 
constant, for each system, irrespective of the number of polymers, supposedly large. 
Proceeding so, the gap for systems Vi, V2, and V3 is respectively evaluated to 0.00104, 
0.00357, and 0.0281. 

For our MDD approach, localization parameters are needed. They are shown in Ta- 
ble below. Additionally, we need to provide the algorithm with an initial guess on the 
size rrii of the blocks. Based on physical considerations on the expected repartition of the 
electrons in the molecule and on the expected localization of the orbitals, the sizes were 
fixed to values indicated in Table 14.11 The specific block Ci is then initialized in one of 
the following three manners: 

• strategy Ti: the entries of C are generated randomly, which of course generically 
yields a bad initial guess way; 

• strategy X2: each block Ci consists of the lowest rui (generalized) eigenvectors as- 
sociated to the corresponding block matrices and Si in the matrices H and S, 
respectively. This provides with an initial guess, depending on the matrices H and 
S, thus of better quahty than the random one provided by strategy Ii, 

• strategy X3: the initial guess provided by I2 is optimized with the local fine solver 
described in section 13.21 
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Vi 


V2 


Vs 


n 


130 


200 


308 


q 


50 


80 


126 


Bandwith of S 


59 


79 


111 


Bandwith of H 


99 


159 


255 


Cut-off for entries of H 


10-12 


10-12 


10-10 


Cut-off for entries of D 


10-" 


10-11 


10-^ 


Size of first block 


nil = 67 


nil = 105 


nil = 136 


Size of last block 


nip = 67 


nip = 106 


nip = 137 


Size of a generic block 


nii = 56 


nii = 84 


nii = 104 



Table 1: Localization parameters and initial size of the blocks used in the tests 



Implementation details Exact diagonalizations in the local steps are performed with 
the routine dsbgv.f from the LAPACK package |lj. In the global step, the resolution of 
the linear system involving the Hessian matrix is performed iteratively, using SYMMLQ 
Diagonal preconditionning is used to speed up the resolution. 
The calculations have been performed using only one processor of a bi-processor Intel 
Pentium IV-2.8 GHz. 



Criteria for comparison of results For assesment of the quality of the results, we 
have used two criteria, regarding the ground-state energy and the ground-state density 
matrix, respectively. For either quantity, the reference calculation is the calculation using 
the Gaussian package . The quality of the energy is measured using the relative error 
I E — £'0 1 

ge = — TTT-j — • For evaluation of the quality of the density matrix, we use the L°° matrix 

\Eo\ 

norm 

Coo = sup Dij - [D^].j , (4.1) 

(i,i)S.t. \H,,\<e 

where we fix e = 10"i°. The introduction of the norm ()4.1|1 is consistent with the cut-off 
performed on the entries of H (thus the exact value of e chosen). Indeed, in practice, the 
matrix D is only used for the calculations of various observables (for instance electronic 
energy and Hellman-Feynman forces), all of the form Tt{AD) where the symmetric matrix 
A shares the same pattern as the matrix H (see j7] for details). The result is therefore 
not sensitive to entries with indices such that \Hij\ is below the cut-off value. 
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4.2 Illustration of the role of the local and global steps 

Our MDD method consists in three ingredients: 

• the local optimization of each block performed in the local step; 

• the transfer of vectors from some blocks to other blocks, along with the modification 
of the block sizes mj, again in the local step; 

• the optimization performed in the global step. 

To highlight the necessity of each of the ingredients, and their impacts on the final result, 
we compare our MDD algorithm with three simplified variants. Let us denote by 

• strategy Si. local optimization of the blocks, without allowing variations of the 
block sizes, and no global step; 

• strategy 1S2: full local step (as defined in Section IT^ . no global step; 

• strategy S3: local optimization of the blocks, without allowing for variations of the 
block sizes, and global step; 

• strategy 1S4: full algorithm. 

We compare the rate of convergence for the above four strategies. Two categories of 
tests are performed, depending on the quality of the initial guess. The results displayed 
on Figures ini to El concern polymer Vi with = 801 monomers. This corresponds to 
Nh = 8050 and N = 5622. Analogous tests were performed on V2 and V3, but we do not 
present them here, for brevity. 

The energy of the ground state of this matrix (i.e. the minimum of ()l.fi|l ) is Eo = 
—27663.484. The number of blocks considered is p = 100. For the global step, we have 
collected these 100 blocks in 99 overlapping groups of 2 blocks. Interestingly, such a 
partition provides with optimal results regarding CPU time and memory requirement. It 
is observed on Fig. that Si, S2 and 1S3 are not satisfactory for they converge towards 
some local, non global, minima of (j2.2p whatever the initial guess. The failure of the 
strategy 5*3 performed on the initial guess X2 is surprising: this initial guess is not good 
enough. Indeed, if the initial guess is I3, we check numerically that the strategies S^ and 
1S4 behave identically. Notice that the strategy I3 is identical to ^2 applied to the initial 
guess I2. 

We also remark that the strategy ^4 performs very well whatever the initial guess (see 
Fig. [7|and Fig. The same behavior is observed for the polymers V2 and V3. Finally, 
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after orthonormalization, the Density Matrix Minimization (DMM) method [18j failed 
with the random initial guess and reveals very slow with the initial guess X2. That is the 
reason why we consider the initial guess X3 to compare these methods. 

4.3 Comparison with two other methods 

Having emphasized the usefulness of all the ingredients of our MDD algorithm, we now 
compare it to two other algorithms: 

• the diagonalization routine dsbgv.f bom the LAPACK library; 

• the Density Matrix Minimization (DMM) method [TS] . 

These two algorithms are seen as prototypical approaches for standard diagonalization 
algorithms and linear scaling techniques respectively. They are only used here for com- 
parison purposes. Regarding linear scaling methods, two other popular approaches are the 
Fermi Operator method ^I] and the McWeeny iteration method [21]. We have observed 
that, at least in our own implementation, based on the literature, they are outperformed 
by the DMM method for the actual chemical systems we have considered. We therefore 
take DMM as a reference method for our comparison. 

Recall that the routine dsbgv.f consists in the three-step procedure 

• transform the generalized eigenvalue problem into a standard eigenvalue problem 
by applying a Cholesky factorization to S; 

• reduce the new matrix to be diagonalized to a tridiagonal form; 

• compute its eigenelements by using the implicit QR method. 

The algorithmic complexity of this approach is in and the required memory scales as 

For the description of DMM method, we refer to ^H]. Let us only mention here that 
this approach consists in a minimization procedure, applied to the energy expressed in 
terms of the density matrix. Both the algorithmic complexity and the memory needed for 
performing the DMM approach scale linearly with respect to the size Nf, of the matrix. 
The DMM method is initialized with the density matrix D = CC^ computed with the 
initial guess C of the domain decomposition method. Two important points for the tests 
shown below are the following. 

First, we perform a cut-off on the coefficients on the various matrices manipulated 
throughout the calculation: only the terms of the density matrices within the frame 
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defined in Figure ID are taken into account. Such a cut-off has some impact on the quahties 
of the results obtained with the DMM method. We are however not able to design a better 
comparison. 

Second, the DMM method requires the knowledge of the Fermi level (as is the case for 
the linear scaling methods commonly used in practice to date). The determination of the 
Fermi level is the purpose of an outer optimization loop. In contrast, the MDD approach 
computes an approximation of the Fermi level at each iteration. Here, for the purpose of 
comparison, we provide DMM with the exact value of the Fermi level. Consequently, the 
CPU times for the DMM method displayed in the sequel are underestimated. 

We emphasize that the routine dsbgv.f computes the entire spectrum of the matrix, 
both eigenvalues and eigenvectors. In contrast, the MDD approach only provides with 
the lowest eigenvalues, among Nb, and the projector on the vector space spanned by 
the corresponding eigenvectors, not the eigenvectors themselves. 

4.3.1 Comparison with Direct diagonalization and DMM 

We have computed the ground states of the polymers Vi, V2 and V3 with the three meth- 
ods (direct diagonalization, DMM and MDD) and for various numbers rim of monomers, 
corresponding to matrix sizes A*";, in the range 10^-10^. 

For DMM and MDD, the initial guess is generated following the strategy I3. The 
results regarding the CPU time at convergence and the memory requirement are displayed 
on Figures ITUl to IT^ for the polymers Vi, V2, and V3 respectively. 

For small values of Nj,, i.e. up to around 10"^, the results observed for the direct 
diagonalization, DMM and MDD agree. The CPU times for our MDD approach scale 
linearly with Nf,. 

For larger values of Nf,, the limited memory prevented us from either performing an 
exact diagonalization or from implementing DMM. So, we extrapolate the CPU time and 
memory requirement according to the scaling observed for smaller Ni,. 

The data for the DMM method are not plotted in Figure El as the DMM method 
does not converge for the polymer V3 when the number of monomers exceeds 10^. From 
our point of view, it comes from the truncation errors which cause the divergence of the 
method (note that the truncation strategy we consider here is very simple). 

4.3.2 Comparison with DMM and a hybrid strategy 

We now concentrate on the two approaches that scale linearly, namely DMM and MDD. 
We consider 

• Vi with 4001 monomers, corresponding to N}, = 40050, 
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• V2 with 2404 monomers, corresponding to A^";, = 24080, 

• V3 with 208 monomers, corresponding to A^^ = 854. 

These particular values have been chosen for the purpose of having simple values for 
the numbers of blocks. For each of the three polymers, we compare the DMM and MDD 
methods initialized by the strategy X3 and a hybrid strategy. The hybrid strategy consists 
of a certain number of iterations performed with MDD, until convergence is reached for 
this method, followed by iterations with DMM. We use the following stopping criterion 
for MDD: 

\\Dn - Dn^iW > - D^_2\\ and - Dn-i\\ < e„ (4.2) 

where is a threshold parameter. We take = 10""^, respectively ea = 10~^, for the 
polymer Vi, respectively V2 and V3. 

The Figures fT!?l to ITHl show the evolution of the error in density versus CPU time. The 
hybrid version is demonstrated to be a very efficient combination of the two algorithms. 

For completeness, let us highlight the temporary increase for the error in density 
appearing in Fig. ITU when MDD is used on V2- Analogously, the energy of the current 
solution, which is actually below the reference energy, also increases. In fact, this is due 
to a loss of precision in the orthonormality constraints. In MDD, these constraints are 
not imposed exactly at each iteration, but only approximately (see equation l3.12|) . 

Finally, we report in figures ^1 to ^1 the results obtained with MDD for the largest 
possible case that can be perfomed on our platform, owing to memory limitation. We 
used the initial guesses obtained with the strategy X3. Notice that for the local step the 
memory requirement scales linearly with respect to the number of monomers, while 
for the global step, the memory requirement is independent of n^- Therefore, for large 
polymers, the memory needed by MDD is controled by the local step. In contrast, for 
small polymers, the most demanding step in terms of memory is the global step. 

5 Conclusions and remarks 

The domain decomposition algorithm introduced above performs well, in comparison to 
the two standard methods considered. More importantly, our approach is an effective pre- 
conditionning technique for DMM iterations. Indeed, MDD provides a rapid and accurate 
approximation, both in terms of energy and density matrix, regardless of the quality of 
the initial guess. In contrast, DMM outperforms MDD when the initial guess is good, but 
only performs poorly, or may even diverge, when this is not the case. The combination of 
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the two methods seems to be optimal. More generally, our MDD algorithm could consti- 
tute a good preconditionner to all variational methods, such as the Orbital Minimization 
method [201 • 

Regarding the comparison with DMM, the following comments are in order. 

• All our calculations have been performed on a single processor machine. Poten- 
tially, both DMM and MDD should exhibit the same speed-up when parallelized. 
We therefore consider the comparison valid, at least qualitatively, for parallel imple- 
mentations. The parallelization of the MDD is currently in progress, and hopefully 
will confirm the efficiency of the approach. 

• We recall the Fermi level has to be provided to the DMM method. This is an 
additional argument in favor of the MDD approach. 

• The MDD method, in contrast to the other linear scaling methods, does not perform 
any truncation in the computations. So, once the profile of C is choosen, the method 
does not suffer of any instabilities, contrary to DMM (or OM) for which divergences 
have been observed for the polymer V3. 

• The domain decomposition method makes use of several threshold parameters. For 
the three polymers we have considered, the optimal values of these parameters, 
except for the stopping criterion ea (equation ()4.2j) ). are the same. We do not know 
yet if this interesting feature is a general rule. 

• Recall our method solves problem ()2.2|) . which is only an approximation of prob- 
lem fl2.1|) . Therefore, the relative error obtained in the limit is only a measure of 
the difference between fl2.2|) and ()2.1|) . In principle, such a difference could be made 
arbitrarily small by an appropriate choice of the parameters of problem ()2.2j) . 

• Finally, let us emphasize that there is much room for improvement in both the local 
and the global steps. We have designed an overall multilevel strategy that performs 
well, but each subroutine may be significantly improved. Another interesting issue 
is the interplay between the nonlinear loop in the Hartree-Fock or Kohn-Sham prob- 
lems (Self- Consistent Field - SCF - convergence [ZIEIE]) and the linear subproblem 
considered in the present article. Future efforts will go in these directions. 
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Figure 2: Block structure of the matrices C. Note that by construction each block only 
overlaps with its nearest neighbors. 
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Nb= (p+1) n/2 
Figure 3: Block structure of the matrix H. 
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Nb= (p+1) n/2 
Figure 4: Block structure of the matrix D. 
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Figure 5: Collection of p = 10 blocks into r = 3 groups. 




Figure 6: Energy error versus CPU time obtained with a bad initial guess 




Figure 7: Density error versus CPU time obtained with a bad initial guess 
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Figure 8: Energy error versus CPU time obtained with a better initial guess 
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Figure 9: Density error versus CPU time obtained with a better initial guess 
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Figure 10: Scaling of the CPU time (top) and memory requirement (bottom) for the 
polymer Vi. 
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Figure 11: Scaling of the CPU time (top) and memory requirement (bottom) for the 
polymer V2- 
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Figure 12: Scaling of the CPU time (top) and memory requirement (bottom) for the 
polymer V3. 
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Figure 13: Evolution of the density error with the CPU time for the polymer Vi made of 
4001 monomers. 
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Figure 14: Evolution of the density error with the CPU time for the polymer V2 made of 
2404 monomers. 
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Figure 15: Evolution of the density error with the CPU time for the polymer V3 made of 
208 monomers. 




Figure 16: Evolution of the MDD energy and density errors versus CPU time for the 
polymer Vi (20001 monomers, A^^e, = 200050). 
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Figure 17: Evolution of the MDD energy and density errors versus CPU time for the 
polymer V2 (12004 monomers, Ni, = 120080). 
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Figure 18: Evolution of the MDD energy and density errors versus CPU time for the 
polymer V3 (5214 monomers, A*";, = 36526). 



